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The statistical problem of estimating the effective dimension- 
reduction (EDR) subspace in the multi-index regression model with 
deterministic design and additive noise is considered. A new proce- 
dure for recovering the directions of the EDR subspace is proposed. 
Under mild assumptions, y^-consistency of the proposed procedure 
is proved (up to a logarithmic factor) in the case when the struc- 
tural dimension is not larger than 4. The empirical behavior of the 
algorithm is studied through numerical simulations. 



1. Introduction. One of the most challenging problems in modern statis- 
tics is to find efficient methods for treating high-dimensional data sets. In 
various practical situations the problem of predicting or explaining a scalar 
response variable Y by d scalar predictors X^, . . . , arises. For solving 
this problem one should first specify an appropriate mathematical model and 
then find an algorithm for estimating that model based on the observed data. 
In the absence of a priori information on the relationship between Y and 
X = (X^\ . . . ,X^), complex models are to be preferred. Unfortunately, 
the accuracy of estimation is in general a decreasing function of the model 
complexity. For example, in the regression model with additive noise and 
two-times continuously differentiable regression function / : R d -> R, the 
most accurate estimators of / based on a sample of size n have a quadratic 
risk decreasing as n -4// ( 4+rf ) when n becomes large. This rate deteriorates 
very rapidly with increasing d leading to unsatisfactory accuracy of estima- 
tion for moderate sample sizes. This phenomenon is called "curse of dimen- 
sionality", the latter term being coined by Bellman (1961). 

To overcome the "curse of dimensionality", additional restrictions on the 
candidates / for describing the relationship between Y and X are necessary. 
One popular approach is to consider the multi-index model with m* indices: 
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for some linearly independent vectors i? m * and for some function 

g : W m — ► M, the relation f(x) = g(i!}Jx, . . . , fi^x) holds for every x G W 1 . 
Here and in the sequel the vectors are understood as one column matrices 
and Af T denotes the transpose of the matrix M. Of course, such a restriction 
is useful only if m* < d and the main argument in favor of using the multi- 
index model is that for most data sets the underlying structural dimension 
m* is substantially smaller than d. Therefore, if the vectors . . ., $ m * are 
known, the estimation of / reduces to the estimation of g, which can be 
performed much better because of lower dimensionality of the function g 
compared to that of /. 

Another advantage of the multi-index model is that it assesses that only 
few linear combinations of the predictors may suffice for "explaining" the 
response Y. Considering these combinations as new predictors leads to a 
much simpler model (due to its low dimensionality), which can be success- 
fully analyzed by graphical methods, see [12], [7] for more details. 

Thus, throughout this work we assume that we are given n observations 
(Yi,X\), . . . , (Y n ,X n ) from the model 

Y = fiX,) +si = g(#JXi, fiT.Xi) + £i , (1.1) 

where e±, . . . ,e n are unobserved errors assumed to be mutually independent 
zero mean random variables, independent of the design {X,, i < n}. 

Since it is unrealistic to assume that . . . , $ m * are known, estimation of 
these vectors from the data is of high practical interest. When the function 
g is unspecified, only the linear subspace S$ spanned by these vectors may 
be identified from the sample. This subspace is usually called index space or 
dimension-reduction (DR) subspace. Clearly, there are many DR subspaces 
for a fixed model /. Even if / is observed without error, only the smallest 
DR subspace, henceforth denoted by S, can be consistently identified. This 
smallest DR subspace, which is the intersection of all DR subspaces, is called 
effective dimension-reduction (EDR) subspace [18] or central mean subspace 
[8]. We adopt in this paper the former term, in order to be consistent with 
[15] and [22], which are the closest references to our work. 

The present work is devoted to studying a new algorithm for estimating the 
EDR subspace.We call it structural adaption via maximum minimization 
(SAMM). It can be regarded as a branch of the structure-adaptive (SA) 
approach introduced in [16], [15]. 

Note that a closely related problem is the estimation of the central subspace 
(CS), see [12] for its definition. For model (1.1) with i.i.d. predictors, the 
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CS coincides with the EDR subspace. Hence, all the methods developed for 
estimating the CS can potentially be applied in our set-up. We refer to [8] for 
background on the difference between the CS and the central mean subspace 
and to [10] for a discussion of the relationship between different algorithms 
estimating these subspaces. 

There are a number of methods providing an estimator of the EDR subspace 
in our set-up. These include ordinary least square [17], sliced inverse regres- 
sion [18], sliced inverse variance estimation [11], principal Hessian directions 
[19], graphical regression [7], parametric inverse regression [4], SA approach 
[15], iterative Hessian transformation [8], minimum average variance estima- 
tion (MAVE) [22] and minimum discrepancy approach [10]. 

All these methods, except SA approach and MAVE, are related to the prin- 
ciple of inverse regression (IR). Therefore they inherit its well known lim- 
itations. First, they require a hypothesis on the probabilistic structure of 
the predictors usually called linearity condition. Second, there is no theoret- 
ical justification guaranteeing that these methods estimate the whole EDR 
subspace and not just a part thereof (cf. [9, Section 3.1] and the comments 
on the third example in [15, Section 4]). In the same time, they have the 
advantage of being simple for implementation and for inference. 

The two other methods mentioned above - SA approach and MAVE - have 
much wider applicability including even time series analysis. The inference 
for these methods is more involved than that of IR based methods, but SA 
approach and MAVE are recognized to provide more accurate estimates of 
the EDR subspace. 

These arguments, combined with the empirical experience, indicate the com- 
plementarity of different methods designed to estimate the EDR subspace. 
It turns out that there is no procedure among those cited above that outper- 
forms all the others in plausible settings. Therefore, a reasonable strategy for 
estimating the EDR subspace is to execute different procedures and to take 
a decision after comparing the obtained results. In the case of strong con- 
tradictions, collecting additional data or using extra-statistical arguments is 
recommended. 

The algorithm SAMM we introduce here exploits the fact that the gradi- 
ent V/ of the regression function / evaluated at any point x G M. d be- 
longs to the EDR subspace. The estimation of the gradient being an ill- 
posed inverse problem, it is better to estimate some linear combinations of 
V/(Xi), . . . , V/(A n ), which still belong to the EDR subspace. 

Let L be a positive integer. The main idea leading to the algorithm proposed 
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in [15] is to iteratively estimate L linear combinations 0i, . . . ,0l of vectors 
Vf(Xi), . . . , V/(X n ) and then to recover the EDR subspace from the vec- 
tors Pi by running a principal component analysis (PC A). The resulting 
estimator is proved to be -y/n-consistent provided that L is chosen indepen- 
dently on the sample size n. Unfortunately, if L is small with respect to n, 
the subspace spanned by the vectors ,/3l may cover only a part of 

the EDR subspace. Therefore, the empirical experience advocates for large 
values of L, even if the desirable feature of y'n-consistency fails in this case. 

The estimator proposed in the present work is designed to provide a remedy 
for this dissension between the theory and the empirical experience. This 
goal is achieved by introducing a new method of extracting the EDR sub- 
space from the estimators of the vectors /3\, . . . , (5l- If we think of PC A as 
the solution to a minimization problem involving a sum over L terms (see 
(2.4) in the next section) then, to some extent, our proposal is to replace 
the sum by the maximum. This motivates the term structural adaptation 
via maximum minimization. The main advantage of SAMM is that it allows 
us to deal with the case when L increases polynomially in n and yields an 
estimator of the EDR subspace which is consistent under a very weak identi- 
fiability assumption. In addition, SAMM provides a -^/n-consistent estimator 
(up to a logarithmic factor) of the EDR subspace when m* < 4. 

If m* = 1, the corresponding model is referred to as single-index regression. 
There are many methods for estimating the EDR subspace in this case 
(see [23], [13] and the references therein). Note also that the methods for 
estimating the EDR subspace have often their counterparts in the partially 
linear regression analysis, see for example [21] and [6]. 

Some aspects of the application of dimension reduction techniques in bioin- 
formatics are studied in [1] and [5]. 

The rest of the paper is organized as follows. We review the structure- 
adaptive approach and introduce the SAMM procedure in Section 2. The- 
oretical features including i/n-consistency of the procedure are stated in 
Section 3. Section 4 contains an empirical study of the proposed procedure 
through Monte Carlo simulations. The technical proofs are deferred to the 
Appendix. 

2. Structural adaptation and SAMM. Introduced in [16], the structure- 
adaptive approach is based on two observations. First, knowing the struc- 
tural information helps better estimate the model function. Second, im- 
proved model estimation contributes to recovering more accurate structural 
information about the model. These advocate for the following iterative pro- 
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cedure. Start with the null structural information, then iterate the above- 
mentioned two steps (estimation of the model and extraction of the struc- 
ture) several times improving the quality of model estimation and increasing 
the accuracy of structural information during the iteration. 

2.1. Purely nonparametric local linear estimation. When no structural in- 
formation is available, one can only proceed in a fully nonparametric way. 
A proper estimation method is based on local linear smoothing (cf. [14] for 
more details): estimators of the function / and its gradient V/ at a point 
Xi are given by 

n 

arginf £ (Y- -a- c T X^f K^] 2 /b 2 ) 

{Sa)uW¥)}>u)*w. 

where Xij = Xj — Xi, b is a bandwidth and K(-) is a univariate kernel 
supported on [0, 1]. The bandwidth b should be selected so that the ball with 
the radius b and the center at the point of estimation Xi contains at least 
d + 1 design points. For large value of d this leads to a bandwidth of order 
one and to a large estimation bias. The goal of the structural adaptation 
is to diminish this bias using an iterative procedure exploiting the available 
estimated structural information. 

In order to transform these general observations to a concrete procedure, let 
us describe in the rest of this section how the knowledge of the structure 
can help to improve the quality of the estimation and how the structural 
information can be obtained when the function or its estimator is given. 

2.2. Model estimation when an estimator of S is available. Let us start 
with the case of known S. The function / has the same smoothness as g in 
the directions of the EDR subspace S spanned by the vectors i?i , . . . , $ m * , 
whereas it is constant (and therefore, infinitely smooth) in all the orthogonal 
directions. This suggests to apply an anisotropic bandwidth for estimating 
the model function and its gradient. The corresponding local-linear estima- 
tor can be defined by 

(e»)-{g(i)u) T <g<i)^' <*» 
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with the weights u?*- = K(\IL*Xij\ 2 /h 2 ), where h is some positive real num- 
ber and II* is the orthogonal projector onto the EDR subspace S. This choice 
of weights amounts to using infinite bandwidth in the directions lying in the 
orthogonal complement of the EDR subspace. 

If only an estimator S of S is available, the orthogonal projector II onto the 
subspace S may replace II* in the expression (2.1). This rule of defining the 
local neighborhoods is too stringent, since it definitely discards the directions 
belonging to S^-. Being not sure that our information about the structure 
is exact, it is preferable to define the neighborhoods in a softer way. This is 
done by setting Wij = K(Xjj(I + p~ 2 u)Xij /h 2 ) and by redefining 

Here, p is a real number from the interval [0, 1] measuring the importance 
attributed to the estimator II. If we are very confident in our estimator II, 
we should choose p close to zero. 

2.3. Recovering the EDR subspace from an estimator o/V/. Suppose first 
that the values of the function V/ at the points Xi are known. Then S is the 
linear subspace of M. d spanned by the vectors V/(Xj), i = 1, . . . , n. For clas- 
sifying the directions of M. d according to the variability of / in each direction 
and, as a by-product identifying S, the principal component analysis (PCA) 
can be used. 

The PCA method is based on the orthogonal decomposition of the matrix 
J( = n ~ l J2i=i Vf{Xi)Vf(Xi) T : JZ = OAO T with an orthogonal matrix O 
and a diagonal matrix A with diagonal entries Ai > A2 > . . . > A^. Clearly, 
for the multi-index model with m*-indices, only the first m* eigenvalues of 
j$ are positive. The first m* eigenvectors of ^ (or, equivalently, the first m* 
columns of the matrix O) define an orthonormal basis in the EDR subspace 

Let L be a positive integer. In [15], a "truncated" matrix is considered, 
which coincides with if L equals n. Let {ifj£, £ = 1, . . . , L} be a system of 
functions on M. d satisfying the conditions n" 1 J27=i V^(Xj)?/V(Aj) = 8^1 for 
every £, £' G {1, . . . , L}, with bi^i being the Kronecker symbol. Define 



n 

Pe = n- l Y,Vf(X i )MXi) 



(2.3) 
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and^L = J2e=i PiPj ■ By the Bessel inequality, it holds < More- 
over, since jfo ' J%l = -Ml^ ■, any eigenvector of j#t is an eigenvector of ,y#£. 
Finally, by the Parseval equality, = if L = n. Note that the estima- 
tion of has been treated in [20] . 

The reason of considering the matrix jtfti, instead of ^# is that can 
be estimated much better than In fact, estimators of j% have poor 
performance for samples of moderate size because of the sparsity of high 
dimensional data, ill-posedness of the gradient estimation and the non-linear 
dependence of j% on V/. On the other hand, estimation of j^i, reduces to 
the estimation of L linear functionals of V/ and may be done with a better 
accuracy. The obvious limitation of this approach is that it recovers the 
EDR subspace entirely only if the rank of coincides with the rank of 
which is equal to m* . To enhance our chances of seeing the condition 
rank(^x) = m* fulfilled, we have to choose L sufficiently large. In practice, 
L is chosen of the same order as n. 

In the case when only an estimator of V/ is available, the above described 
method of recovering the EDR directions from an estimator of have 
a risk of order y 1 Ljn (see [15, Theorem 5.1]). This fact advocates against 
using very large values of L. We desire nevertheless to use many linear 
combinations in order to increase our chances of capturing the whole EDR 
subspace. To this end, we modify the method of extracting the structural 
information from the estimators (3g of vectors /3g. 

Let m > m* be an integer. Observe that the estimator Yi m of the projector 
IT* based on the PCA solves the following quadratic optimization problem: 

minimize ^ $J(I - U)p e subject to LT 2 = II, trll < m, (2.4) 

i 

where the minimization is carried over the set of all symmetric (d x d)- 
matrices. The value m* can be estimated by looking how many eigenvalues 
of II m , are significant. Let A m be the set of (dx cf)-matrices defined as follows: 

A m = {n : n = n T , o ^ n ^ /, trn < m}. 

From now on, for two symmetric matrices A and B, A ^ B means that 
B — A is semidefinite positive. Define II m as a minimizer of the maximum 
of the $J(I — II)/%'s instead of their sum: 

n m E arginf max/37 (/ ~ n )^- ( 2 -5) 
neA m l 

This is a convex optimization problem that can be effectively solved even for 
a large d although a closed form solution is not known. Moreover, as we will 
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show below, the incorporation of (2.5) in the structural adaptation yields an 
algorithm having good theoretical and empirical performance. 

3. Theoretical features of SAMM. Throughout this section the true 
dimension m* of the EDR subspace is assumed to be known. Thus, we are 
given n observations (Y\,Xi), . . . , (Y n ,X n ) from the model 

Y % = f{X % ) +Ei = gitiJXi, . . . , ti^Xi) + e h 

where e±, . . . ,e n are independent centered random variables. The vectors i?j 
are assumed to form an orthonormal basis of the EDR subspace entailing 
thus the representation II* = J^^JLi^j^J ■ I n what follows, we mainly con- 
sider deterministic design. Nevertheless, the results hold in the case of ran- 
dom design as well, provided that the errors are independent of X%, . . . , X n . 
Henceforth, without loss of generality we assume that < 1 for any 
i = 1, . . . , n, where \v\ denotes the Euclidian norm of the vector v. 

3.1. Description of the algorithm. The structure-adaptive algorithm with 
maximum minimization consists of following steps. 

a) Specify positive real numbers o p , ah, pi and h\. Choose an integer L 
and select a set {ipe, i < L} of vectors from W 1 verifying \ipi\ 2 = n. 
Set k = 1. 

b) Initialize the parameters h = h±, p = p\ and IIo = 0. 

c) Define the estimators V/(Xj) for i = 1, . . . ,ra by formula (2.2) with 
the current values of h, p and II. Set 

1 n ^ 

& = -E V /PQ)Y>^ £ = 1,...,L, (3.1) 

where tpn is the zth coordinate of ipi. 

d) Define the new value 11^ as the solution to (2.5). 

e) Set pk+i = a p ■ pk, /ifc+i = a-h • hk and increase k by one. 

f) Stop if p < p m i n or h > /i max , otherwise continue with the step c). 

Let k(n) be the total number of iterations. The matrix n^( n ) is the desired 
estimator of the projector II*. We denote by II n the orthogonal projection 
onto the space spanned by the eigenvectors of n&( n ) corresponding to the m* 
largest eigenvalues. The estimator of the EDR subspace is then the image 
of II n . Equivalently, II n is the estimator of the projector onto S. 
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The described algorithm requires the specification of the parameters p\ , h\, 
a p and ah, as well as the choice of the set of vectors {ipe}. In what follows 
we use the values 

1, Pmin = n- 1 /^ m *\ a n = e-iAsvm*^ 



pi = p 



h = C7 n^/(4vd) 5 ^ = 2 ^ ah = e l/2(4vd)_ 

This choice of input parameters is up to some minor modifications the same 
as in [16], [15] and [21], and is based on the trade-off between the bias and 
the variance of estimation. It also takes into account the fact that the local 
neighborhoods used in (2.1) should contain enough design points to entail 
the consistency of the estimator. The choice of L and that of vectors ipn will 
be discussed in Section 4. 



3.2. Assumptions. Prior to stating rigorous theoretical results we need to 
introduce a set of assumptions. From now on, we use the notation / for the 
identity matrix of dimension d, \\A\\ 2 for the largest eigenvalue of A T ■ A and 
H-AH2 f or the sum °f squares of all elements of the matrix A. 

(Al) There exists a positive real C g such that |Vp(x)| < C g and \g(x) — 
g(x') — (x — x') T X7g(x)\ < C g \x — x'\ 2 for every x,x' € M m . 

Unlike the smoothness assumption, the assumptions on the identifiability of 
the model and the regularity of design are more involved and specific for 
each algorithm. The formal statements read as follows. 

(A2) Let the vectors (3 t G R d be defined by (2.3) and let B* = {(3 = 

J2e=i c iPt '■ J2e=i \ c t\ — !}■ There exist vectors - ■ ■ ,/3 m * 6 &* and 
constants fi± , . . . , p m * such that 

m* 

n* d E KM*- (3-2) 
fc=i 

We denote p* = p\ + . . . + pk- 

Remark 3.1. Assumption (A2) implies that the subspace S = Im(Jl*) is 
the smallest DR subspace, therefore it is the EDR subspace. Indeed, for any 
DR subspace S', the gradient Vf(Xi) belongs to S' for every i. Therefore 
Pe G S' for every £ < L and B* C S' . Thus, for every /3° from the orthogonal 
complement S ,JL , it holds \W j3°\ 2 < EkPk\P>kP°\ 2 = °- Therefore S' 1 - C S 1 - 
implying thus the inclusion S C S' . 
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Lemma 3.1. If the family spans W 1 , then assumption (A2) is always 
satisfied with some Hk (that may depend on n). 

Proof. Set * = (^i, ... , Vl) G ^ nxL , Yf = (V/(-Xi), . . . , V/(X n )) G M dx ™ 
and write the d x L matrix B = {(3\ , . . . , in the form Vf • ^. Recall that if 
Mi , M2 are two matrices such that M\ ■ M2 is well defined and the rank of Mi 
coincides with the number of lines in M2, then rank(Mi • M2) = rank(Mi). 
This implies that rank(£?) = m* provided that rank(^) = n, which amounts 
to span({V>4) = R n . 

Let now (3\, . . . , (3 m * be a linearly independent subfamily of I < L}. Then 
the m*th largest eigenvalue A m *(#) of the matrix^ = J2T=i PkPj is strictly 
positive. Moreover, if v±, ... ,v m * are the eigenvectors of M corresponding 
to the eigenvalues AiC#) > . . . > A m *C#) > 0, then 

m* to* m* 

n * = J2 v k v k ^ T — H AfeUfeUjf = A~L# = A~» 5^ /3 fc /3j. 
fc=i Am * fc=i fc=i 

Hence, (3.2) is fulfilled with = 1/A m *(^#) for every k = 1, . . . , m*. □ 

These arguments show that (A2) is a fairly weak identifiability assumption. 
In fact, since we always choose {ipi} so that span({^}) = M. n , (A2) amounts 
to requiring that the value ji* remains bounded when n increases. 



Let us proceed with the assumption on the design regularity. Define Pi 



k — 

(I+p^U*) 1 / 2 , Z\f = (hkP^Xij and for any dxd matrix U set w\f(U) = 

K{{z\?yuz\f), 4*\U) = K'{{z[fYUZ\f) : N?\U) = e s *%Hu) 
and 

v} k \u) = ±( z U)( z }k ) ) T 4\u). 

j=i \ ij ' \ y / 

(A3) For some positive constants Cy, Ck, Ck>, C w and for some a €]0, 1/2], 
the inequalities 

\\V^ k \ur l \\N\ k \u)<Cv, i = l,...,n, (3.3) 

n 

4? (U)M k) (U)<C K , j = l,...,n, (3.4) 



i=l 
n 



£ \4*\U)\/N?\U) < C K ,, j = 1, . . . ,n, (3.5) 



i=i 



£ \4j\u)\/W k Hu) <C W i = l,...,n, (3.6) 

3=1 
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hold for every k < k(n) and for every d x d matrix U verifying \\U — 
I\\ 2 < a. 

(A4) The errors {£i,i < n} are centered Gaussian with variance a 2 . 

3.3. Main result. We assume that the kernel K used in (2.2) is chosen to 
be continuous, positive and vanishing outside the interval [0, 1]. The vectors 
are assumed to verify 

max max \^a\ <4>, (3-7) 

l=l,...,L i=l,...,n ' 

for some constant ij) independent of n. In the sequel, we denote by C, C\, . . . 
some constants depending only on m*,fi*, C g , Cy, Ck, Ck>, C w and ty. 

Theorem 3.1. Assume that assumptions (Al)-(A^) are fulfilled. There 
exists a constant C > such that for any z e]0, 2>/log(nL)] and for suffi- 
ciently large values of n, it holds 

I 71 ^ ~~~ „ 2 9 2zcnJJFa \ , z 2 -i 3k(n) — 5 
P \/tr(J - n n )n* > Cn~3^t 2 n + , . < Lze — s~ + 



v/n(l - Cn)/ " n 



where Co = ^y/dCxCy , t n = 0(y / log(Ln)) and ( n = 0(t n n svm* ). 

Corollary 3.1. Under the assumptions of Theorem 3.1, for sufficiently 
large n, it holds 

„ 2 9 2J2n* zcna\ r z 2 -i 3k(n) — 5 

p n ra - n* 2 > Cn-^t 2 n + 7 / < Lze -— + — 



E(]|n n -TTh)<C[ n-V^hl + + ^ (3fc(n) " 5) . 

vn / n 



Proof. Easy algebra yields 

l|n n _ n* ||| = tr(n n - n*) 2 = tr n 2 - 2 tr n n n* + tr n* 

< trfi n + m* -2trfi n n* < 2m* - 2trfi n n*. 

The equality tr II* = m* and the linearity of the trace operator complete the 
proof of the first inequality. The second inequality can be derived from the 
first one by standard arguments in view of the inequality ||II n — ITU 2 , — 2m*. 

□ 
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These results assess that for m* < 4, the estimator of S provided by the 
SAMM procedure is y^-consistent up to a logarithmic factor. This rate 
of convergence is known to be optimal for a broad class of semiparametric 
problems, see [3] for a detailed account on the subject. 

Remark 3.2. The inspection of the proof of Theorem 3.1 shows that the 
factor multiplying the "bias" term n - 2 /(3Vm*) disappears when m* > 3. 

Remark 3.3. The same rate of convergence remains valid in the case when 
the errors are not necessarily identically distributed Gaussian random vari- 
ables, but have (uniformly in n) a bounded exponential moment. This can 
be proved along the lines of Proposition 4.3, see Appendix. 

Remark 3.4. Note that in (A3) we implicitly assumed that the matrices 

are invertible, which may be true only if any neighborhood E lyk \Xi) = 
{x : \(I + p~^ 2 Yi*)~ l / 2 (Xi — x)\ < /ifc} contains at least d design points 
different from JQ. The parameters hi, pi, a p and are chosen so that 
the volume of ellipsoids E^ k \Xi) is a non-decreasing function of k and 
Vol(E^ (Xi)) = Co/n. Therefore, from theoretical point of view, if the 
design is random with positive density on [0, l] d , it is easy to check that for 
a properly chosen constant Co, assumption (A3) is satisfied with a probabil- 
ity close to one. In applications, we define hi as the smallest real such that 
min^i,,,, n ^E^\Xi) = d + 1 and add to Vi a small full-rank matrix to be 
sure that the resulting matrix is invertible, see Section 4. 

Remark 3.5. In the case when m = n 1 ' d is integer, an example of deter- 
ministic design satisfying (A3) is as follows. Choose d functions h^ : [0, 1] — > 
[0,oo [ such that fnf[ 0i i] h) c (x) > and supr 0]1 i hk{x) < oo. Define the de- 
sign points {Xi} by {flj^™ hi(x) dx,... ll^ m h d (x) dx}, where ii,...,i d 
range over {0, . . . , m — 1}. This definition guarantees that the number of de- 
sign points lying in an ellipsoid E is asymptotically of the same order as 
nVol(E), as n — > oo. This suffices for (A3). Of course, it is unlikely to have 
such a design in practice, since even for small m and moderate d it leads to 
an unrealistically large sample size. 

4. Simulation results. The aim of this section is to demonstrate on sev- 
eral examples how the performance of the algorithm SAMM depends on the 
sample size n, the dimension d and the noise level a. We also show that 
our procedure can be successfully applied in autoregressive models. Many 
unreported results show that in most situations the performance of SAMM 
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is comparable to the performance of SA approach based on PCA and to that 
of MAVE. A thorough comparison of the numerical virtues of these meth- 
ods being out of scope of this paper, we simply show on some examples that 
SAMM may substantially outperform MAVE in the case of large "bias" . 

The computer code of the procedure SAMM is distributed freely, it can be 
downloaded from http://www.proba.jussieu.fr/pageperso/dalalyan/. It requires 
the MATLAB packages SDPT3 and Yalmip.We are grateful to Professor 
Yingcun Xia for making the computer code of MAVE available to us. 

To obtain higher stability of the algorithm, we preliminarily standardize 
the response Y and the predictors More precisely, we deal with Yi = 

Yi/ay and X = diag(S x )" 1/2 V, where Oy is the empirical variance of Y, 
Y*x is the empirical covariance matrix of X and diag(Sx) is the d x d 
matrix obtained from Sx by replacing the off-diagonal elements by zero. To 
preserve consistency, we set A jfc(n ) = diag(Sx)~ 1/2 /3^fc( n ), where h,k(n) is 
the last-step estimate of and define II fc ( n ) as the solution to (2.5) with (5i 
replaced by f3e,k(n)- Furthermore, we add the small full-rank matrix Id+i/n 

toE- =1 (iJ(iJ T ^m (2.2). 

In all examples presented below the number of replications is N = 250. The 
mean loss er/v = J2j eT j an d the standard deviation J jj J2j ( er j ~~ er^) 2 
are reported, where erj = ||n^) — II* || with Ivi) being the estimator of II* 
for jth replication. 



4.1. Choice of {ip£,l < L}. The set {ipi} plays an essential role in the 
algorithm. The optimal choice of this set is an important issue that needs 
further investigation. We content ourselves with giving one particular choice 
which agrees with theory and leads to nice empirical results. 

Let &j, j < d, be the permutation of the set {1, . . . , n} satisfying A^P(i) < 

. . . < IgL. Let S" 1 be the inverse of &j, i.e. &j(&J 1 (k)) = k for every 
k = 1, . . . , n. Define {i^i} as the set of vectors 

/ -MA-ljer 1 (l) .^(fc-ljer^n), 
I cos — — , . . . , cos — 



r , . . . , sm ± 



I ^ sin ( 

normalized to satisfy J27=i ^Ji = n f° r every i. It is easily seen that these 
vectors satisfy conditions (3.7) and span({^}) = M n , so the conclusion of 
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(a) (b) (c) 

Fig 1. (a) Average loss multiplied by y/n versus n for the first step (full line) and the final 
(dotted line) estimators provided by SAMM and for the estimator based on MAVE (broken 
line) in Example 1, (b) (resp. (c)) Average loss versus d (resp. a) for the first step (full 
line) and the final (dotted line) estimators provided by SAMM and for the estimator based 
on MAVE (broken line) in Example 2 (resp. Example 3). 

Lemma 3.1 holds. Above, [n/2] is the integer part of n/2 and k and j are 
positive integers. 

Example 1 (Single-index) . We set d = 5 and f(x) = g(i? T x) with 

g (t) = 4|t| 1/2 sin 2 (7rt), and ■& = (1/VE, 2/y/E, 0, 0, 0) T e M 5 . 
We run SAMM and MAVE procedures on the data generated by the model 

Y, = f(Xi) + 0.5 • e h 

where the design X is such that the coordinates (X^\j < 5, % < n) are i.i.d. 
uniform on [—1,1], and the errors £j are i.i.d. standard Gaussian independent 
of the design. 

Table 1 contains the average loss for different values of the sample size n for 
the first step estimator by SAMM, the final estimator provided by SAMM 
and the estimator based on MAVE. We plot in Figure 1 (a) the average loss 
normalized by the square rood of the sample size n versus n. It is clearly seen 
that the iterative procedure improves considerably the quality of estimation 
and that the final estimator provided by SAMM is -^/n-consistent. In this 
example, MAVE method often fails to recover the EDR subspace. However, 
the number of failures decreases very rapidly with increasing n. This is the 
reason why the curve corresponding to MAVE in Figure 1 (a) decreases with 
a strong slope. 
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Table 1 

Average loss ||II — II* || of the estimators obtained by SAMM and MAVE 
procedures in Example 1. The standard deviation is given in parentheses. 



n 


200 


300 


400 


600 


800 


SAMM, 1st 


0.443 


0.329 


0.271 


0.215 


0.155 




(.211) 


(.120) 


(.115) 


(.095) 


(.079) 


SAMM, Fnl 


0.337 


0.170 


0.116 


0.076 


0.053 




(.273) 


(.147) 


(.104) 


(.054) 


(.031) 


MAVE 


0.626 


0.455 


0.249 


0.154 


0.061 




(.363) 


(.408) 


(.342) 


(.290) 


(.161) 



Table 2 

Average loss ||II — II* || of the estimators obtained by SAMM and MAVE 
procedures'in Example 2. The standard deviation is given in parentheses. 



d 


4 


6 


8 


10 


12 


SAMM 1st 


0.154 


0.242 


0.296 


0.365 


0.421 




(.063) 


(.081) 


(.071) 


(.087) 


(.095) 


SAMM, Fnl 


0.028 


0.048 


0.060 


0.077 


0.098 




(.011) 


(.020) 


(.021) 


(.026) 


(.037) 


MAVE 


0.284 


0.607 


0.664 


0.681 


0.693 




(.147) 


(.073) 


(.052) 


(.054) 


(.044) 



Example 2. For d > 2 we set f(x) = g($ x) with 

g(x) = (xi - xl)(x\ + x 2 ); 

and -&X = (1, 0, . . . , 0) G R d , # 2 = (0, 1, . . . , 0) E R d . We run SAMM and 
MAVE procedures on the data generated by the model 

Y i = f(Xi)+ 0.1- e h i = l,...,300, 

where the design X is such that the coordinates (X^\j < d,i < n) are 
i.i.d. uniform on [—40,40], and the errors £j are i.i.d. standard Gaussian 
independent of the design. The results of simulations for different values of 
d are reported in Table 2. 

As expected, we found that (cf. Figure 1(b)) the quality of SAMM deterio- 
rated linearly in d as d increased. This agrees with our theoretical results. 
It should be noted that in this case MAVE fails to find the EDR space. 

Example 3. For d = 5 we set f(x) = g^ 1 x) with 



g{x) = (l + xi)(l + x 2 )(l + x 3 ) 
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Table 3 

Average loss \\H — II* || of the estimators obtained by SAMM and MAVE 
procedures in Example 3. The standard deviation is given in parentheses. 



a 200 150 100 50 25 10 
SAMM 1st 0.227 0.177 0.141 0.119 0.113 106 
(.092) (.075) (.055) (.051) (.048) (.043) 
SAMM, Fnl 0.125 0.084 0.057 0.039 0.034 0.03 
(.076) (.037) (.026) (.019) (.021) (.018) 
MAVE 0.103 0.087 0.073 0.062 0.063 0.059 
(.041) (.035) (.027) (.023) (.024) (.023) 



and i?i = (1,0,0,0,0), tf 2 = (0,1,0,0,0), d 3 = (0,0,1,0,0). We run SAMM 
and MAVE procedures on the data generated by the model 

Y i = f{X i ) + a-e i , i = l,...,250, 

where the design X is such that the coordinates (X^\j < d,i < n) are i.i.d. 
uniform on [0, 20], and the errors e, are i.i.d. standard Gaussian independent 
of the design. 

Figure 1(c) shows that the qualities of both SAMM and MAVE deteriorate 
linearly in a, when a increases. These results also demonstrate that, thanks 
to an efficient bias reduction, the SAMM procedure outperforms MAVE 
when stochastic error is small, whereas MAVE works better than SAMM in 
the case of dominating stochastic error (that is when a is large). 

Example 4 (time series). Let now T±, . . . , T n+ Q be generated by the autore- 
gressive model 

Ti+e = /(Tj + 5, Tj + 4, Tj + 3,Tj + 2,Tj + i,Tj) + 0.2 • £j, i = 1, . . . , n, 

with initial variables Ti,...,T$ being independent standard normal inde- 
pendent of the innovations which are i.i.d. standard normal as well. Let 
now f(x) = g($ T x) with 

g(x) = -1 + 0.6xi - cos(0.57rx 2 ) + e~ x z, 
t?! = (l,0,0,2,0,0)/v5, 
tf 2 = (0,0,l,0,0,2)/v5, 
tf 3 = (-2, 2, -2, 1,-1, 1)/Vl5. 

We run SAMM and MAVE procedures on the data (X h Y*), i = 1, . . . , 250, 
where Yj = Tj + 6 and Xj = (Tj, . . . , Tj + s) T . The results of simulations re- 
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Table 4 

Average loss ||II — II* || of the estimators obtained by SAMM and MAVE 
procedures in Example 4- The standard deviation is given in parentheses. 



n 


300 


400 


500 


600 


SAMM, 1st 


0.391 


0.351 


0.334 


0.293 




(.172) 


(.161) 


(.137) 


(.132) 


SAMM, Fnl 


0.220 


0.186 


0.174 


0.146 




(.119) 


(.123) 


(.102) 


(.089) 


MAVE 


0.268 


0.231 


0.209 


0.182 




(.209) 


(.170) 


(.159) 


(.122) 



ported in Table 4 show that the qualities of SAMM and MAVE are compa- 
rable, with SAMM being slightly more performant. 

Appendix. Since the proof of the main result is carried out in several 
steps, we give a short road map for guiding the reader throughout the proof. 
The main idea is to evaluate the accuracy of the first step estimators of 
and, given the accuracy of the estimator at the step k, evaluate the accuracy 
of the estimators at the step k + 1. This is done in Subsections 4.2 and 4.3. 
These results are based on a maximal inequality proved in Subsection 4.5 
and on some properties of the solution to (2.5) proved in Subsection 4.6. The 
proof of Theorem 3.1 is presented in Subsection 4.4, while some technical 
lemmas are postponed to Subsection 4.7. 

4.2. The accuracy of the first-step estimator. Since at the first step no in- 
formation about the EDR subspace is available, we use the same bandwidth 
in all directions, that is the local neighborhoods are balls (and not ellipsoids) 
of radius h. Therefore the first step estimator j3\ f t of the vector /3| is the 
same as the one used in [15]. 

PROPOSITION 4.1. Under assumptions (Al),(A3), (A4) and (3.7), for ev- 
ery i < L, 

\Px / -fa\<h 1 C g V2C^+-^=, 
where is a zero mean normal vector verifying E|^i^| 2 < 2da 2 CyCKw ■ 

Proof. Since at the first iteration we take S\ = I, the inequality |5iJQj| < 
h\ implies that |II*A"«| < {X^A < h\. Therefore the bias term \P*(Ej3\£ — 
(5i)\ is bounded by h\C g \[Cy (cf. the proof of Proposition 4.2). 
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For the stochastic term, we set £1^ = h\^/n{fi\^ — E(3\^). By Lemma 4.4, 
we have E\P^i g\ < da 2 CyCKw ■ The assertion of the proposition follows 
now from P{ = (I + p^ 2 W)- 1 / 2 y I/y/2. □ 

Corollary 4.1. If nL > 6 and the assertions of Proposition 4.1 hold, 
then 

P max /?! t - fa} > hiCgVCy H 7 — 1= < -■ 

V t h,i\/n J n 

Remark 4.1. In order that the kernel estimator of V/(x) be consistent, 
the ball centered at x with radius h\ should contain at least d points from 
{Xi,i = 1, . . . , n}. If the design is regular, this means that h\ is at least of 
order n~ l l d . The optimization of the risk of with respect to h\ verifying 
hi > n~ l / d leads to the choice h\ = Const. n~ 1 ^ Ayd > . 



4.3. One step improvement. At the kth step of iteration, we have at our 
disposal a symmetric matrix n £ Mdxd belonging to the set 

^(IT) = {n G M dxd : trU" < m*, * U * I, tr(/ - 11)11* < 5 2 } 

Thus the matrix II is the kth step approximation of the projector II* onto the 
EDR subspace S* . Using this approximation, we construct the new matrix 
II in the following way: Set Su, P = (I + P" 2 n) 1 /2 5 p U p = 5-1 anc i define 
the estimator of the regression function and its gradient at the design point 
Xj as follows: 



where Wij(U) = K{h 2 \Su, p Xij\ 2 ) and 

w-t(i)(i) T ^0D- 

To state the next result, we need some additional notation. Set Zij = 
(hP^Xij, U = P^^P; and U* = I, where P* p = P n *, P = (I - IT) + 
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,p;v/ n (x, 



hr 1 



ill np>) v -^ l (l *l>)tri(L)«*<m 



hP, 



P / 7=1 



n y s 



3=1 



where Wij(U) = K{Zj,U Zij) and 



Set Ni(U) = Ej Wij{U) and a = 25 2 p~ 2 + 25p~\ 

PROPOSITION 4.2. // (Al)-(A4) are fulfilled then there exist Gaussian vec- 
tors Q, ... ,C L e K d smc/i i/iat E[|Q| 2 ] < ego- 2 and 



P sup P;(Pi,n-0t) 



where the sup is taken over II € £ = 1,...,L and we used the notation 
t n = 5 + ,/31og(Ln) + \d? \ogn, c = ^^dC K C v and a = 30^(C 2 Cf C 2 K + 



c 2 v c 2 K ,y/ 2 . 



Proof. Let us start with evaluating the bias term \P*(EPe^u — /3e)\. Ac- 
cording to the Cauchy-Schwarz inequality, it holds 



|P;(E&,n-& 



n 



i=i 



l P P* (E[V/ n pQ)] - V/(Xi))| 2 £ ^ 2 (X0 



i=l 



< max |p;(E[V/nM]-V/(X0)| 

i=l,. ...n r 



20 



DALALYAN, JUDITSKY AND SPOKOINY 



Simple computations show that 
|^(Efe)]-V/(Xi))| 



< 



E 



P* P W(Xi) 



h- 1 



vr'JZnx^Dw^u) 

n 



p;vf(x i ) i 



v Z 



where r - = /(X,-) - /(X*) - XjV/pQ). Define Xj = h^n^w^U) and 
1/2/ 1 



«j = ^ 



Wij(U). Then 
V 2 \ "* \ 



< UK 



-1/2, 



?-7 



1/2 



The identity • u juj = Id+i implies 



TV 



-1/21 



3=1 



<r 2 maxr| VT 1 • 
< CyhT 2 maxr 2 .,- , 

where the maximum of r« is taken over the indices j satisfying io«(C/) 7^ 
0. Since the weights lOjj are defined via the kernel function K vanishing 
on the interval [l,oo[, we havemaxj = max{ry : 1 5"n,p^ij | < h}. By 
Corollary 4.3 \Su,pXij\ < h implies |II*Xy | < (p + S)h. Let us denote by 
the (dxm*) matrix having as kth column. Then IT = 0O T and therefore 

\r ij \ = \f{X j )-f{X i )-X] j Vf{X i )\ 

= \g(@ T X j )-g(@ r X i ) - (@ r X ij ) T Vg(@ T X i )\ 

< CglQ^X^ 2 <Cg{p+5fh 2 . 

These estimates yield |6(^Q) | < \JCy C g (p + S) 2 h, and consequently, 



|P;(E& >n -ft)| < max&pQ) < VC^C g (p + 5) 2 h. 



(4.1) 
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Let us treat now the stochastic term Pp($£ t n — /?*)■ It can be bounded as 
follows 



|P;(/%,n-E&,n)| < 



J2 C jA U ) e j 



where 



1 n / \ 

Let us define Q = h^/n P*((3i n* — E[/3^ n*])- In view of Lemma 4.4, we have 

One checks that for any I = 1, . . . , L and for any IT such that tr(I — 11)11* < 
5 2 , it holds 



P;(&,n-E[&, n ]) 



< sup 

\\U-U*\\ 3 <a 



Set a,j/(U) = Cj/(U) — Cj^{U*). Lemma 4.5 implies that Proposition 4.3 can 
be applied with kq = and k\ = . Setting e = 2a/ \fn we get that 
the probability of the event 



sup 



i=i 



^ cicra(5 + ^3 log(Lra) + 3d 2 \og(y/n)) j 



h^fn 



is less than 2/n. This completes the proof of the proposition. 



□ 



Corollary 4.2. If nL > 6 and the assumptions of Proposition 4.2 are 
fulfilled, then 



2 C7{ZCQ + Cl at ri )\ r _4=l 



P sup |P;(^,n - > VC>C 5 (p + 5)^ + 
In particular, if nL > 6, the probability of the event 



< Lze 2 . 



sup |P;(ft, n - > VCvCg(p + 5) z h + 



2 cr(2c Vlog(^n) + ciat n ) 



h^fn 



does not exceed 3/ra, where sup is taken over all II G ^(IT*), t = 1, . . . , L 
and cq, c±, t n are defined in Proposition 4.2 and in Theorem 3.1. 
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Proof. In view of Lemma 7 in [16] and Lemma 4.4 , we have 



P( max |^| > zc a) < J^P(|g| > zc a) < Lze~ {z2 ~ 1)/2 . 



The choice z = v /41og(nL) leads to the desired inequality provided that 
nL > 6. □ 



4.4. Proof of Theorem 3.1. Recall that at the first step we use the following 
values of parameters: Hq = 0, p\ = 1 and h\ = n -1 /( dv4 ). Let us denote 



ji = h 1 C g VC v ^ — -= , di=27iV^ 



and introduce the event = {max< — /%| < 71}. According to Corol- 
lary 4.1 the probability of the event Qi is at least 1 — n^ 1 . In view of 
Proposition 4.5, we get P(tr(J - U^U* < 5\) > 1 - n" 1 . 

For any integer k S [2,fc(n)] (where k{n) is the total number of iterations), 
we define 



7 7 2(5 fc _! / 5fe_x 

Pfc — OpPfe-l, "-fc — 0-hflk-l, Oik — 1- l 



k < k(n), 



Ik 



r> ,/n ( n 1 a \2i, 1 t7(2c 0v / log(nL) + cia fc t r; 

hk\/n 

CgVCy (Pk + ojfc-i) "it H r — 7= » k = k{n), 

hkV n 

Q k = 2p*{ 1 lpf + V2 lkP l l C g ) ) 

Sk = Zjk^/p*/ V 1 - Ck, 

n k = {max |P fe *(/3 M - < 7fc }. 



Here /3 M = ± £™ =1 V/ (fe) (Xi)^(*i) with 



lv/ w (x 

and „,« = ^(^KJ + p^n^O^Xyf ). 



3=1 x 7 V 7 7 J=l V 
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Combining Lemmas 4.6 and 4.7, we obtain P(tr(J — IIjt_i)n* > it%._i) < 
P(^fc-i) an d therefore, using Corollary 4.2, we get 

P(^) < P(max \Pk0k,£ -Pt)\> 7k, tr(I - fi fc _i)lT < Ci) + P(^fe-i) 



< P 



sup max |P fe *(4/ - > 7fe) + P(^-i) 



<-+P(^_ x ), k<k(n)-l. 



Since P(fif) < 1/n, it holds P(^ (n) _ 1 ) < (3k(n) - 5)/n and P(fi£ (n) ) < 
Lze _ ( z2_1 ^ 2 + 3fc (")~ 5 1 Lemma 4.7 implies that 



P(tr(J-n fc(n) )lT ><5 2 (n) ) <Lze-^- 1 V 2 + 



3k(n) 



n 



According to Lemma 4.6, we have <5fc( n )_ 2 < Pk(n)-l, a k(n)-i < 4 and 
Cfe(n)-1 — 1/2- Consequently, for n sufficiently large, we have 



v/l — Cfc(n)- 



and a fc(n) < 4<5 fc(n) _ 1 p- ( 1 n) < C^v/Io^L^p^^- 1 ) V n^vm*^ g 
/i fc(n) = 1 and (npfc^))" 1 < p 2 k{n) = n -2/(3Vm*) 5 we infer that 



nice 



7fc(n) = CgVCy (Pfe(n) + ^fc(n)-l) + 



2 r) -2/(3Vm*) _,_ CQ "^ 
71 



Therefore £ n := Ck(n) = ^(7fc(n)Pfc( n )) tends to zero as n — > oo at least as 
fast as ^/log(nL) n ~ 1 /( 6Vm ) and the assertion of the theorem follows from 
the definition of 5fc( n ) and Lemma 4.2 (see below). 



4.5. Maximal inequality. The following result contains a well known maxi- 
mal inequality for the maximum of a Gaussian process. We include its proof 
for the completeness of exposition. 
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Proposition 4.3. Let r be a positive number and let T be a finite set. Let 



functions a 



3rt 



i d obey the conditions 



sup sup | a,j n (u) | 2 < k , 

7GT |u— u*|<r 7=1 



sup sup sup 

7GT |u-u*|<r ee^d-i , =1 



(e T a jVr (u)) 



(4.2) 
(4.3) 



7/ the Ej 's are independent jV(0, a 2 ) -distributed random variables, then 



P( sup sup 

, 7Sr \u—u*\<r 



^a Ji7 (u)£j 

3=1 



> tan® + 2^paaK\e J < — , 

n 



where t = ^31og(|r|(2r/e)Pn). 



Proof. Let B r be the ball {u : \u — u*\ < r} C W and £ rje be the e-net 
on B r such that for any u £ B r there is an element ui S £ rj£ such that 
\u — ui\ < e. It \s easy to see that such a net with cardinality N r ^ < (2r/e) p 
can be constructed. For every u £ B r we denotery 7 (n) = Y^j=i a j,'y( u ) e j- 
Since E(|r/ 7 (u)| 2 ) < ct 2 ^ 2 , for any 7 and for any u, we have 



P(Mu,)l > tan ) < P(|^(tt,)| > t^Ed^^)! 2 )) < te-^-^l 2 . 



Thus we get 



N r . 



P(sup SUp > tcJK o) < X! X! P (l ? ?7( n ')| > tcrK 

v 7 Gr« ; 6S r , E 7 er z=i 

< WMrAe-V-W 2 . 



Hence, if t = ^3 log(|r| N r%e n), then p( sup 7gr sup U;6Er e |j? 7 (uz)| > £ctk ) < 
1/n. On the other hand, for any u, v! S B r the Cauchy-Schwarz inequality 
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yields 



\rj-y{u) — 77 7 (tt')| = sup |e T (r? 7 (it) — r/ 7 (V)) | 



< |ii — u'| 2 • sup sup 

ueB r e£Sd-i 

= |« — u I • sup sup 



d(e T ?7 7 ) 



(«) 



< I u — n' 1 2 • sup sup 



du 

d(e J dj n ) 



du 



(u) 



2 n 

£ 



^ 2i /|2 \ " 2 

j'=l 



Since P(X^y=i e | > Ana 2 ) is certainly less than n \ we have 



P( sup sup > to kq + 2\fnaK\e 



< Pfsup sup > x) + P ( sup sup KW~_7>^))I > l 

V 7er«;es r , E taK > v 7 er«eB r 1^Jn(jK\t 

1 n 2 

< - + Pf sup k\\u - ui(u)\ 2 ^T £ 2 > 4n<r 2 K 2 e 2 ) < -, 
n ^u&Br ^ y n 

and the assertion of proposition follows. □ 



4.6. Properties of the solution to (2.5). We collect below some simple facts 
concerning the solution to the optimization problem (2.5). By classical ar- 
guments, it is always possible to choose a measurable solution II to (2.5). 
This measurability will be assumed in the sequel. 

In Proposition 4.4 the case of general m (not necessarily equal to m*) is 
considered. As we explain below, this generality is useful for further de- 
velopments of the method extending it to the case of unknown structural 
dimension m*. 

The vectors (5i are assumed to belong to a m* -dimensional subspace S of 
R rf , but in this subsection we do not assume that fys are defined by (2.3). 
In fact, we will apply the results of this subsection to the vectors II* (3g. 
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Denote 

R(IL) =max/?7(/-n)/^, 
n{m) = min Jr^L) = ^R(U m ). 

We also define 

K*{m) = min max|(7 - U) 1 / 2 ^. 
ne.4 

and denote by IT^ a minimizer of max^ 0J (I — II) Pi over II 6 „4 m . Since for 
m > m* the projector n* is in A m , we have H* m = II* and 1Z*{m) = 0. 

Proposition 4.4. Let B* = {(3 = J2e c ePi '■ J2i M < 1} ^ e ^ e convex hull 
of vectors fy. //max; |/% — j3(\ < e, then 

1Z{m) < 1Z*{m)+e, 

max 1(7 - n m ) 1/2 /3| < ft*(m) + 2e. 

/3eB* 

When m < m* , we have also the lower bound 1Z(m) > (1Z*(m) — e) + . 
Proof. For every £ 6 1, . . . , L, we have 

|(/ - ii m ) l,2 k\ < \(i - n;j 1/2 A| + 1(/ - n;,) 1 ^ _ ^ 

< 7e*(m) + \p e - 0i\ < 11* (m) + e. 
Since II m minimizes maxf |(7 — II) 1 / 2 /^! over II £ .A m , we have 

max | (7 - n m ) 1 / 2 /3,| < max | (7 - rPJ 1 / 2 /^ < tt*(m) + e. 

Denote A = (7 — II. m ) 1//2 . From definition ^ A -< I. Therefore, for every i 

\A(3 e \ < \A$ t \ + \A{p t - $ t )\ < \A$ e \ + |/% - &| < ft*(m) + 2e. 

The second inequality of the proposition follows now from \A/3\ < max^ \A(3g\ 
for every (3 £ B*. 

To prove the last assertion, remark that according to the definition of 1Z* (m), 
for every matrix II £ «4 m there exists an index £ such that |(7 — IT) 1//2 /3^| > 
K*{m). In particular, |(7 - ft™) 1 / 2 /^ > 11* {m) for some I and hence | (7 — 
fl m ) 1/2 Pe\>\(I-fl m ) 1/2 Pe\-\Pe-Pe\>n*(m)-e. ' □ 
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Proposition 4.4 can be used for estimating the structural dimension m. In- 
deed, 7t(m) < e for m > m* and the results mean that 7Z(m) > (lZ*(m)— e)+ 
for m < m*. Therefore, it is natural to search for the smallest value mofm 
such that the function 7Z(m) does not significantly decrease for m > in. 

From now on, we assume that the structural dimension m* is known and 
write II instead of H m * . 

Proposition 4.5. If the vectors satisfy (A 2) and max f \ Pi — Pel < £, 
then tr(I - U)U* < Ae 2 fi* and tr[(fi - n*) 2 ] < 8eV- 

Proof. In view of the relations trll 2 < trfl < m* and tr(Il*) 2 = trlP = 
m* , we have 

tr(n - n*) 2 = tr(n 2 - n*) + 2tr(j - n)n* < 2| tr(/ - n)n*|. 

Note also that the equality tr(J - fl)n* = tr(J - U) 1 / 2 !!*^ - ft) 1 / 2 implies 
that tr(/ — II)n* > 0. Now condition (3.2) and Proposition 4.4 imply 

tr(/ - n)n* = tr(j - n) 1/2 n*(/ - n) 1/2 

m* 
k=l 

m* m* 

< vkPKi - ft)Pk < (2^) 2 E 

k=l k=l 

and the assertion follows. □ 
Lemma 4.1. Let tr(I — fl)n* < 5 2 for some 5 < 1. Then for any x GR d 

\n*x\ < |n 1/2 x| +s\x\. 

Proof. Denote A = ft 1 / 2 . It obviously holds \U*x\ < \U*Ax\ + \U*(I-A)x\ 
and 

|n*(/ - A)x\ 2 < \\U*(I - A)\\ 2 ■ \x\ 2 < tr[n*(7 - A) 2 U*] ■ \x\ 2 . 

For every n e A m , it obviously holds (/ - n 1 / 2 ) 2 = I- 2n x / 2 + II < I - U, 
and hence, tr U*(I - n x / 2 ) 2 n* < trn*(J - 11)11*. Therefore, 

trn* (i - i) 2 n* < trn*(/ - ri)n* = tr(i - ft)n* < s 2 

yielding |II*x| < |II* Aar| + S\x\ < \Ax\ + 5\x\ as required. □ 
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Corollary 4.3. Let p e (0,1), and S p ^ = {I + p^fl) 1 / 2 . If tr(/-fl)n* < 
5 2 , then for any x € the condition \S p x\ < h implies |n*x| < (p + S)h. 

Proof. The result follows from Lemma 4.1 and the obvious inequalities 
\x\ < \S p x\ < h and | IT 1 / 2 ^ | < p\S p x\ < ph. □ 

Lemma 4.2. Let tr(7 - n)IP < S 2 for some 5 G [0, 1[ and let U m * be 
the orthogonal projection matrix in R d onto the subspace spanned by the 
eigenvectors of II corresponding to its largest m* eigenvalues. Then tr(I — 

fim*)n* <s 2 /(i-5 2 ). 

Proof. Let Xj and #j, j = 1, . . . , d be respectively the eigenvalues and the 
eigenvectors of II. Assume that Ai > A2 > . • • > \d- Then LT = Y^=i ^j$j$J 
and ft m » = Y,f=i h&j- Moreover, Y$=\ = 1 since {■&!,..., $ d } is an 
orthonormal basis of M. d , Therefore, on the one hand, 

tr[nn*]< J2 Ajtr[^-i?Tn*] + A m . trft^Jn*] 

j<m* j>m* 

d 

= E (h - A m *) tr[7?^Tn*] + A m * tr [E^Jn* 

j<m* j=l 

= £ (Aj - A m .) tr[^Tn*] + m*A m *. 

j<m* 

Since tr^tfjn*] = |n*^| 2 < 1, we get tr[niP] < Ei<m* V Taking into 
account the relations J2j<d -\j < m * ; trlP = m* and (1 — A m *+i)(I— II m *) ^ 
I - ft, we get A m *+i < m* - J2j<m* - tr [U _ 9)11*] < 5 2 and therefore 
tr[(/ - 9 m *)n*] < 5 2 /(l - \ m * + i) < 5 2 /(l - 5 2 ). □ 

4.7. Technical lemmas. This subsection contains five technical results. The 
first three lemmas have been used in the proof of Proposition 4.2, whereas 
the two last lemmas have been used in the proof of Theorem 3.1. 

Lemma 4.3. If p < 1, then \\U - U*\\ 2 < a. 

Proof. The inequality P* ^ (I — II*) + pIT implies that 

P 2 \\u-u*\\ 2 = ||p;(n-n*)p;|| 2 

< p 2 ||n*(n - n*)n*|| 2 + n*)(n - n*)(i - n*)|| 2 
+ 2p||n*(n-n*)(i-n*)|| . 
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Since \\A\W = tr AA T < (tr(AA T ) 1/2 ) 2 for any matrix A, it holds 

||n*(n - n*)ir|| 2 = ||n*(j - n)n*|| 2 

< tr n*(i - n)ir = tr(J - n)ir < 5 2 . 



By similar arguments one checks that 

||(i-n*)(n - u*)(i- n*)|| 2 = ||(/-n*)n(i-n*)|| 2 < tr(j-n*)n 

= trn-trn* + trn*(/-n) 

< m* — m* + 5 2 , 

||n*(n - n*) (/ - n*)|| 2 < ||n*(n - n*)|| 2 = ||n*(j - n)|| 2 

< ||n*(j - n) 1/2 || 2 < (trn*(j - n)n*) 1/2 
= (tr(i-n)n*) 1 / 2 <<5. 

Thus we get ||£/-£/*|| 2 < 5 2 {l + p~ 2 ) + 25p~ 1 . The assumption p < 1 yields 
the assertion of the lemma. □ 

Lemma 4.4. Ifipts and U satisfy (A3) and (3.7), then 

dC K C v i? 



Em* 7 )! 2 < 



h 2 



n 



Proof. Simple computations yield 

2 



?J ri (i)l^ =tr(ri) - 



dCy 
Ni ' 



(4.4) 



Hence, we have 



E M 2 



1 ~ / i 



< 



j=i i=i 

2 n / n 



Wijipi(Xi) 



< 



h 2 n 2 . 

C K j> 2 n n 
h 2 n 2 



HE 



i=l 



I 1 



EE 

i=i i=i 



V; 



if 1 



1 



NiWij. 



NiWij 



Interchanging the order of summation and using inequality (4.4) we get the 
desired result. □ 



30 



DALALYAN, JUDITSKY AND SPOKOINY 



Lemma 4.5. If (A3) and (3.7) are fulfilled, then, for any j = 1, . . . ,n, 



sup sup 

u eeS d _i 



< c 



2 h 2 



+ 



2 h 2 



where C is a numerical constant and -gjj(e Cj t i)(U) is the dx d matrix with 

de T c j}l (U) 



entries 



dU v 



Proof. We have 



de T Cj^(U) 



dU 



< 2 



hn y-i dU 

i=i 



Wij{U)tlJi(Xi 



+ 2 



i=i 

A l + A 2 . 



Xfjrsf 1 \ d,Wij(U) 



Zi 



dU 



One checks that \\dwij(U)/dU\\2 = |^-(C/)| • \Zij\ 2 < 5|tt!^-(C/)|, where we 
used the notation w'^{U) = K' (Zj^U Z{j) and the inequality 



2 = \s* p Xij\ 2 = \(i- rr)A%f + 2p- 2 |n*x ii | 2 



/iiz 



< h 2 + 2(S/p + l) 2 h 2 < 5h 2 , 
which follows from Lemma 4.1. We get 



A 9 , < 



5O-0 2 



2 / n 



i=l 



< 



C^ 2 C 2 C 2 K , 



71 



2 h 2 



In order to estimate the term Ai, remark that the differentiation (with 
respect to U pq ) of the identity Vf l {U)Vi(U) = Id+i yields 

BVr 1 f)V- 



Simple computations show that 

dV,. 



dU, 



n 







da 



-Wij(U) 



pq 
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Hence, for any a\, 0,2 £ 



df7 

3=1 



This relation combined with the estimate < 5 for all i,j such that 

Wij 7^ 0, implies the norm estimate 



Tf>-1_ « / - \ / - \ T 



dU 



< 150|ai] laa] 5^ Hj(U)\ 

3=1 

< 150C^C^|ai| |a 2 |iVx(l7)" 1 . 

£>2 (^4 Q2 jp. 

It leads to the estimate Ai < C - Y lh f — , and the assertion of the lemma 



follows. □ 

Lemma 4.6. There exists an integer uq > such that, as soon as n> hq, 
Sk-i < Pfci Ofe < 4 and Cfc < 1/2 /or all k £ {2, ... , k(n)}. 

Proof. In view of Con _1 ^ dv4 ^ = p\h\ and Pki n )h k (n) > C^n -1 / 3 , the se- 
quence 

x /7-v, , , 4o-(c 0v /log(Lra) + cit ra ) 

V«Pfc(n)«fc(n) 

tends to zero as n — > 00. 

We do now an induction on k. S ince 5n — ^ as n — > 00 and 'yi ^ 5^ , the 
inequality 61 = 2ji^/p* < \j\[2 = pi/V2 is true for sufficiently large values 
of n. Let us prove the implication 

s <- „ / m — . / Ct - 1/2 ' 

Since l/\/2 < e _1//6 we infer that <5fc_i < and therefore a k < 4. By our 
choice of ah and a p , we have pi/ii > p k h k > p k ( n )h k ( n ). Therefore, 



lk < A^C gPk h k + ^Mlog(Ln) + Cl t n ) 



Pk Vnp k h k 



< WctcM + - MWlog(Ln) + Cltn) 



Vn~Pk(n)h k (n) 
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Thus, for n large enough, < 1/2 and 7^ < pfc/4. This implies that 5k = 
27* (1 " Cfc)" 1/2 < Pfe/V^- By induct ion we infer that Sk-i < pk-i/V? < Pk 
and (,k < 1/2 for any A; = 2, . . . , fc(n) — 1. This completes the proof of the 
lemma. □ 

Lemma 4.7. If k > 2 and Cfc-i < 1 ^en fj fe _i C {tr(J - n fc _i)II* < 5l_{\. 
Proof. Let us denote $£ = H* f3k-i,t, then 0£ G 5* and under Qk-i we have 

[mane \fi e - Pe\ < v^lk-i/ Pk-i- 

Set B = ES/Kftf and ^ = YT=iMj, where & = if ft = 

T,eCePe- Since £ £ M < 1, w e have \/3i\ < max e \fy\ < ||V/||oo and {fa- fa] < 
max; \Pi — Therefore 

m* _ _ _ _ 

\\B ~B\\<J2 ViWj - < M* max ||ftft T - 

i=i k 

<//max(|ft-ft| 2 + 2|ftHft-A 

< A**(27fe-iPfc-i + 2^27*-^^-! max |ft|) = Cfe-i 

and hence, for every unit vector v G 5*, v T Bv > {y^ Bv — \v T Bv — v T Bv\) > 
v T Bv- 1 1 £-£11 > 1 — Cfc-i- This inequality implies that n* ^ (l-Cfe-i) _1 £ 
and, in view of Proposition 4.5 we obtain the assertion of the lemma. □ 
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